{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The autoreload extension is already loaded. To reload it, use:\n",
      "  %reload_ext autoreload\n"
     ]
    }
   ],
   "source": [
    "%load_ext autoreload\n",
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "sys.path.append(os.path.abspath('..'))\n",
    "from scripts.ab_hypothesis_testing import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Tsebaot\\AppData\\Local\\Temp\\ipykernel_9060\\3627243033.py:1: DtypeWarning: Columns (32,37) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  df=pd.read_csv('../data/MachineLearningRating_v3.csv')\n"
     ]
    }
   ],
   "source": [
    "df=pd.read_csv('../data/MachineLearningRating_v3.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Risk Differences Across Provinces\n",
    "\n",
    "### Analysis Objective\n",
    "This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) across provinces. The goal is to understand how risk varies regionally, which can inform province-specific policies or risk management strategies.\n",
    "\n",
    "### Hypotheses\n",
    "- **Null Hypothesis (H₀):** No risk differences across provinces.\n",
    "- **Alternative Hypothesis (H₁):** Risk differences exist across provinces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ANOVA Test for Risk Differences Across Provinces:\n",
      "Null Hypothesis: No risk differences across provinces\n",
      "F-Statistic: 5.849413762407606\n",
      "P-Value: 1.6782057588675903e-07\n",
      "Result: reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "# Test risk differences across provinces\n",
    "test_risk_differences_across_provinces(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Results\n",
    "- **Test Type:** ANOVA (Analysis of Variance)\n",
    "  - Compares the mean `Total Claims` across multiple provinces.\n",
    "- **F-Statistic:** 5.849\n",
    "  - Indicates that the variance in `Total Claims` between provinces is significantly greater than the variance within provinces.\n",
    "- **p-Value:** 1.678205\n",
    "  - This value is much smaller than the common significance level of 0.05, suggesting the observed differences are highly unlikely to be due to random chance.\n",
    "- **Decision:** Reject the null hypothesis (H₀).\n",
    "\n",
    "### Conclusion\n",
    "There is strong evidence to conclude that **significant risk differences exist across provinces**.\n",
    "\n",
    "### Implications\n",
    "1. **Risk Management:** Provinces with higher average claims may require stricter risk mitigation measures, while lower-risk provinces could benefit from premium reductions.\n",
    "2. **Pricing Strategy:** Develop province-specific premium structures to reflect the risk profile of each province.\n",
    "3. **Further Analysis:** Investigate the factors contributing to risk differences, such as demographic, geographic, or economic factors.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Risk Differences Between Zip Codes\n",
    "\n",
    "### Analysis Objective\n",
    "This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) between zip codes. The goal is to evaluate how risk varies geographically at a finer level, providing insights for localized strategies.\n",
    "\n",
    "### Hypotheses\n",
    "- **Null Hypothesis (H₀):** No risk differences between zip codes.\n",
    "- **Alternative Hypothesis (H₁):** Risk differences exist between zip codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T-Test for Risk Differences Between Zip Codes 1000 and 2000:\n",
      "Null Hypothesis: No risk differences between zip codes\n",
      "T-Statistic: -0.6196332744567253\n",
      "P-Value: 0.5355003097164719\n",
      "Result: fail to reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "# Test risk differences between two zip codes \n",
    "test_risk_differences_between_zipcodes(df, 1000, 2000)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "- **Test Type:** T-Test (Analysis of Variance)\n",
    "  - Compares the mean `Total Claims` across multiple zip codes.\n",
    "- **T-Statistic:** -0.6196\n",
    "  - Suggests a relatively low variance in `Total Claims` between zip codes compared to within zip codes.\n",
    "- **p-Value:** 0.5355\n",
    "  - This value is greater than the common significance level of 0.05, indicating that the observed differences are likely due to random chance.\n",
    "- **Decision:** Fail to reject the null hypothesis (H₀).\n",
    "\n",
    "### Conclusion\n",
    "There is **insufficient evidence** to conclude that significant risk differences exist between zip codes. Any observed differences in `Total Claims` are likely due to random variation.\n",
    "\n",
    "### Implications\n",
    "1. **Uniform Risk Strategy:** Since risk levels appear consistent across zip codes, there may be no need for differentiated risk management or pricing strategies based on zip codes.\n",
    "2. **Further Exploration:** Investigate other variables (e.g., demographics, policy types) that may reveal meaningful differences between zip codes.\n",
    "3. **Combination Analysis:** Consider analyzing zip code risk differences alongside other factors, such as region or gender, for a more comprehensive understanding.\n",
    "\n",
    "### Next Steps\n",
    "- Visualize `Total Claims` by zip code to confirm the absence of meaningful trends.\n",
    "- Conduct subgroup analyses (e.g., zip code and gender combinations) to uncover any hidden patterns.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Margin Differences Between Zip Codes\n",
    "\n",
    "### Analysis Objective\n",
    "This test examines whether **profit margins** (calculated as `Premium - Total Claims`) differ significantly across zip codes. Profit margin reflects the profitability of policies in each zip code, making this analysis critical for identifying underperforming or highly profitable areas.\n",
    "\n",
    "### Key Difference from Risk Differences Test\n",
    "- **Test Risk Between Zip Codes:**\n",
    "  - Focuses on analyzing **Total Claims** to determine if there are significant risk differences between zip codes.\n",
    "  - Guides **risk management** strategies by identifying regions with higher or lower claims.\n",
    "- **Test Margin Differences Between Zip Codes:**\n",
    "  - Focuses on analyzing **Profit Margins** (`Premium - Total Claims`) to determine if there are significant profitability differences between zip codes.\n",
    "  - Guides **pricing and profitability optimization** by highlighting regions that are overperforming or underperforming.\n",
    "\n",
    "### Hypotheses\n",
    "- **Null Hypothesis (H₀):** No significant margin differences between zip codes.\n",
    "- **Alternative Hypothesis (H₁):** Significant margin differences exist between zip codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T-Test for Profit Differences Between Zip Codes 1000 and 2000:\n",
      "Null Hypothesis: No significant margin differences between zip codes\n",
      "T-Statistic: 0.4776384393714926\n",
      "P-Value: 0.6329083469646521\n",
      "Result: fail to reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "# Test profit differences between two zip codes \n",
    "test_profit_differences_between_zipcodes(df, 1000, 2000)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "- **T-Statistic:** 0.4776\n",
    "  - Indicates a smalle  variance in profit margins across zip codes compared to within-group variance.\n",
    "- **p-Value:** 0.6329\n",
    "  - This value is greater than the common significance level of 0.05, indicating that the observed differences are likely due to random chance.\n",
    "- **Decision:** Fail to reject the null hypothesis (H₀).\n",
    "\n",
    "### Conclusion\n",
    "There is **insufficient evidence** to conclude that **significant profit margin differences exist between zip codes**.\n",
    "Profits generated in these two zip codes are statistically similar.\n",
    "\n",
    "\n",
    "### Implications\n",
    "1. From a profit perspective, there is no need to differentiate strategies for these two zip codes.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interpretation: Risk Differences Between Genders\n",
    "\n",
    "### Analysis Objective\n",
    "This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) between genders. Understanding risk differences by gender can help insurers design gender-specific policies or adjust premiums based on claims data.\n",
    "\n",
    "### Hypotheses\n",
    "- **Null Hypothesis (H₀):** No significant risk differences between women and men.\n",
    "- **Alternative Hypothesis (H₁):** Significant risk differences exist between women and men."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T-Test for Risk Differences Between Women and Men:\n",
      "Null Hypothesis: No significant risk differences between women and men\n",
      "T-Statistic: 0.24803623812388725\n",
      "P-Value: 0.8041073961270343\n",
      "Result: fail to reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "# Test risk differences between Women and Men\n",
    "test_risk_differences_between_genders(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "- **Test Type:** T-Test (Independent Samples)\n",
    "  - Compares the mean `Total Claims` between two independent groups: women and men.\n",
    "- **T-Statistic:** 0.248\n",
    "  - Indicates the magnitude of the difference between the means relative to the variability within groups.\n",
    "- **p-Value:** 0.8041\n",
    " - This value is greater than the common significance level of 0.05, indicating that the observed differences are likely due to random chance.\n",
    "- **Decision:**  Fail to reject the null hypothesis (H₀).\n",
    "\n",
    "### Conclusion\n",
    "Risk profiles are similar across genders.\n",
    "\n",
    "### Implications\n",
    "1. Gender-based differences do not appear to influence risk. Policies and strategies can remain gender-neutral for risk management purposes."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "week-3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
